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ABSTRACT 

Graphical models have been widely applied in solving dis- 
tributed inference problems in sensor networks. In this pa- 
per, the problem of coordinating a network of sensors to 
train a unique ensemble estimator under communication con- 
straints is discussed. The information structure of graph- 
ical models with specific potential functions is employed, 
and this thus converts the collaborative training task into a 
problem of local training plus global inference. Two im- 
portant classes of algorithms of graphical model inference, 
message-passing algorithm and sampling algorithm, are em- 
ployed to tackle low-dimensional, parametrized and high- 
dimensional, non-parametrized problems respectively. The 
efficacy of this approach is demonstrated by concrete exam- 
ples. 

1. INTRODUCTION 

It is widely recognized that distributed inference methods 
developed for graphical models comprise a principled ap- 
proach for information fusion in sensor networks (see HI). 
With powerful graphical model (GM) inference tools at hands, 
the similarity between sensor networks and graphical mod- 
els compels researchers to model sensor network problems 
using graphical models so that the deeply researched GM al- 
gorithms, including sum-product, sampling and variational 
algorithms, can be applied to sensor networks. 

Although this analogy seems to be simple, the map from 
sensor network problems to graphical models is not straight- 
forward. As pointed out in JT], it is the informational struc- 
ture of the distributed inference problem, involving the re- 
lationships between sensed information and the variables 
about which we wish to perform estimation, that is just as 
critical as the communication structure of the problem. How 
to model distributed inference problems as graphical mod- 
els is as important as solving the problem itself. A wide 
range of distributed inference problems have been refor- 
mulated with graphical models, including self-localization 
J21, multi-object data association and tracking J3), |4), dis- 
tributed hypothesis testing [5 1, and nonlinear distributed es- 
timation Q. 



One of the advantages of modeling distributed infer- 
ence problems in sensor networks as inference problems on 
graphical models is to find communication-efficient "mes- 
sages" that are exchanged among the sensors. In many ad- 
hoc algorithms for distributed inference problems, the mes- 
sages transmitted among the sensors are problem-specific. 
If we can successfully model these problems as graphical 
models, the messages exchanged among the sensors turn 
out to be exactly the messages specified by the correspond- 
ing graphical model message-passing algorithms such as the 
sum-product algorithm. 

Another issue, much more important in the area of wire- 
less sensor networks than in graphical models, is the com- 
munication cost. In wireless sensor networks, communi- 
cation is usually constrained and expensive, unlike in cen- 
tralized inference algorithms, where message-passing is al- 
most free. This difference leads to totally different opti- 
mization objectives in these two areas, even for exactly the 
same graphical models. In sensor network problems, we 
look for inference algorithms with more local computation 
and less message-passing; in graphical models, we are inter- 
ested in algorithms that minimize the computational com- 
plexity. This brings us new problems like how to decrease 
the amount of data exchanged among the sensors, as de- 
scribed in HI, Q, 0, and 0. This problem is unique for 
graphical models applied to distributed inference problems. 

There are many distributed inference problems that have 
not been described in the language of graphical models. 
One type of such problem falls into the category of dis- 
tributed learning/collaborative training, as described by Predd, 
et al., in ifTUl . ifTTI . and fl2l . In these problems, the sen- 
sors collaboratively train their individual estimators so as to 
minimize the training error of a kernel regression, subject to 
consistency of prediction on shared data. In ATI , an itera- 
tive algorithm is designed to achieve this training goal. 

In our paper, we aim to model the distributed training 
problem as an inference problem on graphical models. In 
these settings, independent and identically distributed (i.i.d.) 
data are collected by different sensors, which are able to 
communicate with each other under some constrains. The 
sensors, without directly sharing their training data (usu- 



ally high-dimensional, confidential and in large amount), at- 
tempt to collaboratively find a good ensemble classifier/estirr 
In our framework, we manage to transform collaborative 
training problem, usually solved by an ad-hoc design (e.g. 
alternating projection) in a sensor network, into an inference 
problem on a graphical model combined with local train- 
ing. This conversion from ad-hoc collaborative training to 
local training plus collaborative inference is due to the ap- 
plication of the graphical model on a functional space of 
classifiers/estimators. The problem of selecting the optimal 
estimator is converted into a maximum a posteriori proba- 
bility (MAP) problem on a graphical model, with each ran- 
dom variable supported on a functional space to which the 
estimators belong. 



2. FROM COLLABORATIVE TRAINING TO 
GRAPHICAL MODELS 

The self-localization problem in [ 1 1 provides us with an ex- 
cellent example of how to convert a distributed inference 
problem in a sensor network into an inference problem in 
a graphical model. Similar to this scheme, we design our 
graphical model for the collaborative training task to con- 
vert a "global collaborative training" into a "local training 
plus global inference" problem. 

To make our model more concrete, we first illustrate the 
system in Fig. [T] where there are 6 sensors, with limited 
communication capability. 




Fig. 1. A typical sensor network and its corresponding 
graphical model abstraction. 

In Fig. [T] we abstract the sensor network into a graph. 
The edges among the nodes represent the condition that the 
nodes are able to communicate with each other. Now, we 
assume that each sensor s maintains a distribution of esti- 
mators, i.e., a random variable F s supported by functional 
space fi, with probability distribution p s (/ s ). 

Then, we assign a potential o- Syt (f s , ft) t° the edge be- 
tween two connected sensors s and t. Here, a s .t is a require- 
ment of the similarity between the estimators maintained by 
adjacent sensors. 

Based on these assumptions, the potential of the entire 



graphical model is of the form 

K/)=|n>c/s) n vsAfsjt). (i) 

s£V (s,t)e£ 

If the graph is loopless, a potential of the form (JTJi is a stan- 
dard form to apply message-passing algorithms, such as the 
sum-product algorithm. These algorithms enable us to find 
the marginal distribution/MAP of the estimator at each sen- 
sor in a distributed way. 

However, it is quite common that loops exist in the sen- 
sor network, yet it is very difficult to do triangulation in 
a system without centralized computation. To allow the 
use of message-passing algorithms, Willsky, et al. apply 
loopy belief propagation described in [13] and [14]. Al- 
though Jordan and Murphy in 1 14 1 have shown some cases 
where loopy belief propagation might lead to erroneous re- 
sults, Willsky et al. have proven that under certain con- 
ditions, loopy belief propagation is a contractive map (for 
some specially defined distances) and hence converges to 
a unique limit. Therefore, loopy belief propagation can be 
applied to make inferences on the graphical model with po- 
tential (jlj. Moreover, sampling algorithms are not affected 
by loops, and can always be applied readily. 

Under our framework, the messages or samples passed 
among the sensors are no longer data instances, but distribu- 
tions or individual samples of functions. Thus, the sensors 
are no longer working with individual data points, but sum- 
maries of data - trained classifiers/estimators. Moreover, the 
message-passing algorithm ends in finite steps for loopless 
graphs, which requires no iteration. These advantages are 
due to the introduction of the graphical model. 

3. LOCAL TRAINING AND GLOBAL INFERENCE 

Based on the model described in the previous section, the 
problem now can be separated into two stages: 

1. Local training: find reasonable potentials p s and a s j', 

2. Global inference: based on the potentials, compute 
the marginal distribution of estimators at each sensor. 

We now discuss the details of these two stages. 
3.1. Local Training for Potentials 

There are two ways to obtain the potential p s based on lo- 
cal data. If there is a good model or prior knowledge of 
the parameters of f s , then we can use local data to find p s . 
However, in most of the cases, the distribution of parame- 
ters to be estimated is rather hard to compute explicitly. In 
this case, we can simply employ a bootstrap algorithm to 
re-sample the local training data for each individual sensor, 
and locally train a group of estimators, which can be used to 



approximate the distribution of the estimator. Thus, p s can 
be specified in a parametric (using the sample estimators to 
estimate the distribution of parameters) or non-parametric 
way (using several typical estimators as "particles"). 

Finding cr s t is more complicated. This is because in 
many cases, a s ,t is related to the statistical properties of the 
global estimator to be determined - i.e. it might be hard to 
estimate locally. In this paper, we restrict our discussion to 
problems where the entire system has one unique hypothe- 
sis. Thus, a s ,t is simply predetermined as 



Cs,t(fs, ft) — 



1 f s and ft are the same 
otherwise 



(2) 



although in practice, we sometimes need to relax the peaky 
function slightly. With the assumption above, we can dis- 
cuss two classes of algorithms in graphical models for our 
collaborative training, message-passing (for parametrized, 
low-dimensional cases) and sampling (for non-parametrized, 
high-dimensionally cases). 

Interestingly, it can be shown that for a simple hypothe- 
sis testing problem (H vs. H{), with the similarity function 
defined as in (j2j), the likelihood ratio methods (using the en- 
tire data set) and the collaborative training methods (com- 
puting the likelihood based on local data, and find marginal 
distribution by global inference) result in exactly the same 
outcome, given that the data collected by the sensors are 
i.i.d. To some extent, this supports the choice of our simi- 
larity function and the optimality of the scheme. 



3.2. Message-Passing for Parametrized Cases 

Message-passing is an accurate inference algorithm on graph- 
ical models. Here we assume that the classifier/estimator of 
sensor s can be parametrized by parameter x s - 

Then, according to the sum-product algorithm described 
in IT51 . given the potential ([TJ, the marginal at node s is of 
the form 

p(x s ) oc p(x s ) Yi M ts( x s), (3) 

where M* s (x s ) is the message from sensor t to sensor s. 
When the graph is loopless, the sum-product algorithm will 
converge in finitely many steps and converge to a unique 
limit. However, for arbitrary graph structure, the sum-product 
form might only be an approximation. Here we directly 
apply the message recursion update formula for the sum- 
product algorithm: 



M ts (x s ) 



E 



<x s>t (x s ,x' t )p(x' t ) J[ M ut (x' t ) 

uGJV(i)\s 



Specially, if we assume that the potentials are of Gaussian 
form, 

(x s - ps) 2 



and 



p s (x s ) = exp 



<7 a ,t(x a ,xt) = exp 



(x s - x t ) 2 
2AL 



(5) 



(6) 



where X s t is close to to ensure consensus, then the mes- 
sage is also of a Gaussian form: 



ts {x s ) = exp <j —2 

Zcr ts 



(7) 



Therefore, message updating, as described in Q, can be 
simplified to an update of the parameters p ts and of s , as 
specified below: 

_ Mi/ 0- ? + J2ueAf(t)\s Mut/°«t 



1M 2 + E„eAT(t)\a 1 / (T lt 



and 



T 2 — \ 2 

'ts ~ A t,S 
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uGAf(t)\s 



(9) 



After the messages converge, the marginal distribution 
p s (x s ) of each sensor s is still a Gaussian distribution with 
parameters given by 

in/of + E„eA/-(t) /W°-«« 

and 



(4) 



l/at+ I (ID 

It can be shown that when A S) ( — > 0, and the network is 
loopless, the MAP estimation of each sensor after running 
message-passing converges to the average of p, s weighted 
by l/cr 2 , which is exactly how we combine i.i.d. Gaus- 
sian observations with different variances. In this sense, our 
scheme finds the optimal solution. 

3.3. Sampling Algorithms for High-dimensional Cases 

Message-passing for the parametrized case is straightfor- 
ward. However, when the classifiers/estimators reside in 
some high-dimensional space (very common for most learn- 
ing problems) and cannot easily be parametrized (like neu- 
ral networks and decision trees), it is difficult to update the 
parameters directly and message-passing can be unimple- 
mentable. In this case, we resort to sampling methods to 
effectively search for the optimal classifiers/estimators. 



The first problem we face is to find an expression for the 
distributions of estimators f s of individual sensors. Usu- 
ally, by bootstrapping, we can obtain a group of "particles" 
°f the distribution of f s . If we define a kernel 
K(-, •) (non-negative), i.e. a measure of similarity of the 
estimators, then we can write the distribution of f s as 



(12) 



If we assume that the true hypothesis is unique, then we 
need to enforce consensus among the sensors; thus we de- 
fine the similarity function as in |2]). 

With these assumptions, the marginal distribution of any 
sensor t in (|T]i has the form (after summing out all the other 
variables) 

M n 3 

p{ft)=nj2K(h S jj t ). (O) 

8=13=1 

Therefore, the accurate MAP solution for the collaboratively 
trained estimator is given by 



M n s 

f* = arg max TT V K(h SJ ■ , /) , 



fen„ 



8=1 3 = 1 



(14) 



where M represents the total number of sensors and n s is 
the number of "particles" bootstrapped by sensor s. More- 
over, for simplicity, we define 



H s = {h sj \se{l,...,M},je{l,...,n s }}. 



(15) 



It is difficult to apply an accurate inference algorithm to 



solve ( 14 1, because the closed-form messages involve an in- 
creasingly intricate product-sum of the kernels. Therefore, 
we resort to sampling methods to distributively tackle prob- 
lem ( fT4| i. 

For simplicity, we only discuss the Gibbs Sampling case, 
which, in our model, reduces to the following algorithm: 



Repeat 

1. Randomly select sensor s; 

2. Fix the sample values of its neighbors {ft}teAf, ; 

3. Conditioned on {ft}t&M s < resample f s based on 
the distribution of / 

i(IW. *.,*(/*>/)) (Ei=i#(W)' 



where / e {Ml <j< n s } U {f t \t £ AT.}; 



End 



Algorithm 1: Sampling Algorithm 

In the algorithm, the function cr s ,t{'> ') (non-negative) is the 
similarity function (a very peaky function). However, in 



practice, to make the sampling algorithm non-trivial and 
more forgiving, <7 s t should be relaxed, allowing some dis- 
crepancy between its two inputs, and thus preventing the 
algorithm from falling into a trivial solution. Moreover, no- 
tice that Af s represents the set of neighbors of sensor s, and 
the restrictions of the space in which / resides is due to the 
locality constraints. 

In practice, the algorithm might converge rather slowly. 
In that case, we can change the random resampling (step 3) 
of the algorithm into a deterministic optimization, i.e. we 
can replace step 3 by 



fs = arg max TT a s<t (f t , /) 
\teM s 




K(h sj ,f) 



(16) 

This algorithm is prone to falling into a local minimum, yet 
converges faster. It is almost a greedy approach to finding 
the solution of ( fl4"| i. 

There is one further subtle issue in implementing the 
sampling algorithm. Since the kernel is defined on various 
forms of classifiers/estimators, it might be hard sometimes, 
say, to define the similarity/distance between a decision tree 
and a neural network. If we define the prediction of estima- 
tor fi on the training set to be a vector f x and the prediction 
of estimator f 2 on the training set to be a vector f 2 , then we 
use the similarity between these two vectors as that of the 
two estimators. 

In the collaborative training scenario, however, the en- 
tire training set is not accessible to each individual sensor; 
thus K and cr s t can only be estimated locally, i.e., sen- 
sor s can only compute K and cr s t based on its own data. 
Therefore, in the above algorithm, the kernels are actually 
subscripted. We will show empirically that this local data 
restriction indeed compromises the performance of the sys- 
tem, yet this is the price we pay for distributed algorithms. 

The sampling algorithm has the advantage that only a 
properly selected kernel (a measure of similarity among clas- 
sifiers) is required, unlike the case of message-passing, where 
we usually expect a linear Euclidean space of parameters. 
Moreover, sampling is not affected by the loops in the undi- 
rected graph model, and the sensors can update their sam- 
ples asynchronously, as long as the samples of their Markov 
blankets are fixed. 

4. EXPERIMENTS 

4.1. Message-Passing Algorithm: Linear Regression 

Assume that m sensors are distributed in the domain [0, l] 2 . 
The sensors cooperate to estimate the slope k of a straight 
line z = kx. The sensor at location (x, y) 6 [0, l] 2 observes 
a noisy version of the value of z at x, and the noise at point 
(x, y) is additive and has a Gaussian distribution of variance 



cr 2 sin 2 (27rx). We also assume that each sensor can query 
the value of observations of its neighbors within a radius r. 

For this problem, each sensor is capable of estimating 
the global model based on its own observation and those of 
its neighbors - because the slope k can be estimated well 
even if we observe only a small part of the straight line. 
So for this consensus problem, the key step is to find the 
potential/distribution of each individual sensor. Bootstrap- 
ping is a suitable method in this case. Each sensor s sim- 
ply bootstraps over its accessible data (the data of itself and 
its neighbors) and uses the sample distribution to approx- 
imate the potential p s . For computational simplicity, we 
parametrize these distributions as Gaussian (even though 
this is not accurate) so that we can simply apply the pa- 
rameter update formula derived from the previous section. 

There are 50 sensors with communication radius of 0.2 
in this consensus problem, i.e. m = 50, r = 0.2 and A s t 
A typical result of the simulation is shown in Fig. [2] 



ogy of the 20 sensors is a random graph of expected degree 
of 4. And each sensor, by bootstrapping, generates 4 clas- 
sifiers (chosen to be standard decision tree classifiers pro- 
vided by MATLAB). We define the kernel as 



K(fsJt) 



1 



(17) 



where f s denotes the vector of prediction of classifier f s on 
all the local training data points, || • || denotes the Hamming 
distance of two 2-symbol vectors, and n is the total num- 
ber of local data. Moreover, we select er s ,t(-, ') = ^('i ') 3 to 
make it a properly peaky function. 

The sensors initialize their sampled classifiers by solv- 



ing ( 14 1 based on their individual data (i.e., they solve the 



optimization problem of ( 14 1 without the product). Running 
the greedy version of the algorithm [T] for 4000 rounds, we 
obtain the results shown in Fig. |3]and Table [T] 
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Fig. 2. Performance of the collaborative training algorithm 
running on the sensor network. The plot on top depicts 
the decreasing trend of the test error. The plot on the bot- 
tom demonstrates the attenuation of the variance of the es- 
timated parameter, indicating the speed with which the sys- 
tem approaches consensus. 

Note that the estimates of the slopes of different sensors 
in the distributed system come to consensus rather quickly 
- the variance of slopes among the sensors reduces to a neg- 
ligible level after a few rounds. On the other hand, the av- 
erage test error decreases quickly, very close to the perfor- 
mance of centralized linear regression. 

4.2. Sampling Algorithm: Decision Tree Classifiers 

We select the Chess data set (King-Rook vs. King-Pawn), a 
3196-instance, 36-dimension, 2-class data set from the UCI 
machine learning repository, for this experiment. We ran- 
domly select 2000 data points, evenly distributed at 20 dif- 
ferent sensors, as the training set, and use the remaining 
1196 data points as the test set. The communication topol- 



Fig. 3. Histograms of test errors among the sensors before 
and after running the sampling algorithm. 



Data 


Algorithm 


Test Error 


Centralized 
Distributed 
Distributed 
Distributed 
Distributed 


Centralized decision tree 
Centralized solution to jl4|i 
Non-collaborative training 
Sampling algorithm 

Average of all classifiers 


.0109 

.0702 
.0941 (median) 
.0702 (median) 

.0669 



Table 1. Test errors of different algorithms, compared with 
the result of the sampling algorithm. 

As shown in the results, the sampling algorithm (based 
on Gibbs Sampling) enables a major portion of the sensors 
in the network to find the optimal classifier, with respect to 
the distributed data, centralized solution to (14) , and much 
better than the results given by non-collaborative training. 
A simple average of all the bootstrapped classifiers (similar 
to bagging) seems to be slightly better, yet the generated 



classifier is much more complicated than the results of the 
sampling algorithm (80 trees vs. 1 tree). 

In this example, we have seen that our scheme of collab- 
orative training can be quite effective even for a very com- 
plex, high-dimensional space of classifiers, without trans- 
mitting any training data points. 

5. CONCLUSIONS AND DISCUSSIONS 

We have applied our scheme to both parametrized, low- 
dimensional cases and non-parametrized, high-dimensional 
cases, and accurate message-passing and approximate sam- 
pling algorithms demonstrate their efficacy for these two 
cases separately. Without directly sharing data, the sensors 
are able to reach consensus and collaboratively search for a 
classifier/estimator satisfying certain optimality properties. 

Although the "collaborative" part of our algorithms is 
based on message-passing or sampling algorithms borrowed 
from graphical models, another essential step of our algo- 
rithm is local training, as we only briefly resort to boot- 
strapping in this paper. It is of interest to find more de- 
tailed statistical tools to estimate these potentials, or more 
specifically, the distributions of classifiers/estimators so that 
we may be able to guarantee stronger optimality and obtain 
better performance. 

Despite the issues and challenges described above, we 
have shown the efficacy of this framework with two differ- 
ent examples. It is worthwhile to design more delicate algo- 
rithms and to prove stronger results under this framework. 
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